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£" — We study pool-based active learning of halfspaces, in which a learner receives a 

t-H pool of unlabeled examples, and iteratively queries a teacher for the labels of ex- 

amples from the pool, in order to identify all the labels of pool examples. We 
revisit the idea of greedily selecting examples to label, and use it to derive an ef- 
ficient algorithm, called ALuMA, that approximates the optimal label complexity 
for a given pool in R d . We show that ALuMA obtains an 0(d 2 log(d)) approxi- 
c/3 mation factor if the examples in the pool are numbers with a finite accuracy. We 

further prove a result for general hypothesis classes, showing that a slight change 
to the greedy approach leads to an improved target-dependent guarantee on the 
label complexity. In particular, we conclude a better guarantee for ALuMA if the 
target hypothesis has a large margin. We further compare our approach to other 
t-H common active learning strategies, and provide a theoretical and empirical evalu- 

\Q ation of the advantages and disadvantages of the approach. 
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OO 1 Introduction 

O 033 

Pool-based active learning |McCallum and Nigam, 1998 1 is useful in many data-laden applications, 
where unlabeled data is abundant but labeling is expensive. In this setting the learner receives as 
input a set of instances, denoted X — {xx, . . . , x m }. Each instance xi is associated with a label 
L(i), which is initially unknown to the learner. The learner has access to a teacher, represented by 
the oracle L : [m] —> { — 1,1}. The goal of the learner is to find the values L(l), . . . , L(m) using 
as few calls to L as possible. We assume that L is determined by a function h taken from a pre- 
defined hypothesis class W. That is, 3h E H such that for all i, L(i) — h(xi). We denote this by 
L ^ h. A pool-based algorithm can be used to learn a classifier in the standard PAC model, while 
querying fewer labels. We discuss this in Section [6] We mainly deal with the hypothesis class of 
homogeneous halfspaces in H. d , namely, X C R d and % — {x M> sgn({w,x}) : w S where 
(w, x) is the inner product between the vectors w and x. 



The set of all hypotheses in T~L that are consistent with the labels that currently known to the learner 
is called the version space. Many active learning algorithms maintain a version space, and use it to 



decide which label to query next. For example, the CAL algorithm | Cohn et aL| 1994 1 selects an 
instance at random and queries its label only if there are two hypotheses in the version space that 
disagree on its label. Ton g and Koller| [2002 1 proposed a more aggressive greedy selection policy 
for halfspaces: query the instance from the pool that splits the version space as evenly as possible, 
in terms of volume in R d . To implement this policy, one would need to calculate the volume of 
a convex body (the version space), which is known to be computationally intractable. Tong and 
Roller implemented several heuristics that attempt to follow their proposed selection principle using 
an efficient algorithm. For instance, they suggest to choose the example which is closest to the max- 
margin solution of the data labeled so far. However, none of their heuristics provably follow this 
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greedy selection policy. Our first contribution is an efficient algorithm, which relies on randomized 
approximation of the volume of the version space [e.g. Kan nan et al.||1997| . This allows us to prove 
that our algorithm follows an approximate greedy rule for minimizing the volume of the version 



The proposed algorithm can be theoretically analyzed as follows. Given a pool-based active 
learning algorithm A, denote by N(A, h) the number of calls to L that A makes before out- 
putting (L(xi), . . . , L(x m )), for L <^ h. The worst-case label complexity of A is defined to be 
max/j 6 -H N(A, h) and the average-case label complexity of A is defined to be Eh^pN(A, h), where 
P is some predefined distribution over the hypothesis class H. We denote the optimal worst-case 
label complexity for the given pool by OPT max and the optimal average label complexity (for some 
fixed distribution over hypotheses) by OPT avg . 



Dasgupta [2005] showed that if an exact greedy algorithm splits the probability mass of the version 
space (as defined by P) as evenly as possible, then its average label co mplexity using the same P is 
bounded by 0(log(l/p m ; n ) • OPT avg ), where p min = min/ ie -« P{h). Golovin and Krause |2010| 



extended Dasgupta's result and showed that a similar bound holds for an approximate greedy rule 
They also showed that the worst-case label comple xity of an approxi mate greedy rule is at most 
O (log ( 1 /p min ) ■ OPT max ) , thus exending a result of |Arkin et at] | |1993) . 



The distribution P over hypotheses that matches our volume-splitting strategy is one that draws a 
halfspace uniformly from the unit ball in In this case P(h) is the probability mass of all the 
hypotheses inducing the same labeling as h on X. If there are instances in X that are very close to 
each other, then p m ; n might be very small. Our second contribution is to show that mild conditions 
suffice to guarantee that p m i n is bounded from below. In particular, by proving a variant of a result 
due to |Muroga et al. | p 96 1 1 , we show that if the examples in the pool X are stored using number of 
a finite accuracy 1/c, then p m i„ > (c/d) d . It follows that the worst-case label complexity of our 
algorithm is at most 0(d 2 log(d/c)) • OPT max . 



While this result provides us with a uniform lower bound on p m ; n , in many real-world situations 
the probability of the target hypothesis (i.e., one that is consistent with L) could be much larger 
than p m in- A noteworthy example is when the target hypothesis separates X with a margin of 7. In 
this case, it can be shown that the probability of the target hypothesis is at least 7 rf , which can be 
significantly larger than p m i n . An immediate question is therefore: can we obtain a target-dependent 
label complexity bound of log(l/P(ft,)) ■ OPT max , where h is the target hypothesis? 



We prove that such a target dependent bound does not hold for a general approximate greedy 
algorithm. Nontheless, in our third main contribution we show that by introducing an algo- 
rithmic change to the approximate greedy policy, we do obtain the label complexity bound of 
log(l/P(/i)) • OPT max . In particular, we run an approximate greedy procedure, but stop the proce- 
dure early, before reaching a pure version space that exactly matches the labeling of the pool. We 
then use an approximate majority vote over the version space to determine the labels of X. 



We also derive lower bounds, showing that the dependence of our label-complexity guarantee on 
the accuracy c, or the margin parameter 7, is indeed necessary and is not an artifact of our analysis. 
We do not know if the dependence of our bounds on d is tight. It should be noted that some of 
the most popular learning algorithms (e.g. SVM, Perceptron, and AdaBoost) rely on a large-margin 
assumption to derive dimension-independent sample complexity guarantees. In contrast, here we 
use the margin for computational reasons. Our approximation guarantee depends logarithmically on 
the margin parameter, while the sample complexities of SVM, Perceptron, and AdaBoost depend 
polynomially on the margin. Hence, we require a much smaller margin than these algorithms do. 
Balcan et al. 1 2007) proposed an active learning algorithm with dimension-independent guarantees 



under a margin assumption. These guarantees hold for a restricted class of data distributions. 

Lastly, we compare the greedy approach of our algorithm to other previously proposed active learn- 
ing strategies, both theoretically and experimentally. We underscore cases in which it can be benefi- 
cial to apply an aggressive greedy approach instead of a more mellow selective-sampling approach, 
but also show that each approach has advantages and disadvantages in different cases. The empirical 
evaluation indicates that our algorithm, which can be implemented in practice, achieves state-of-the- 
art results. It further suggests that aggressive approaches can be better than mellow approaches in 
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'We discuss the challenges presented by other natural choices of P in Section|5] 
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some practical settings as well. In this work we do not treat the case of labeling errors — this is left 
for future work. 

110 

The rest of the paper is organized as follows. We start in Section |2| with a formal statement of our 
main results, and present our algorithm and its analysis in Sectionj3] We give a proof sketch of our 
main theorem in Section|4j Proofs for the rest of the results are deferred to the appendix. In Section[5] 
we consider other possible solutions to the problem of efficient greedy learning of halfspaces, and 
discuss the challenges they present. Our comparison to other algorithms is given in Section|6] 



2 Main Results 
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Let P be a distribution over a hypothesis class H. Given the pool X = {xi, . . . , x m }, and some 
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h eH, denote by P(h) the probability mass of the set {h! G H : Vi, h'(xi) = h(xi)}. Let V t C H 
be the version space of an active learner after t queries. For a given pool example x G X, denote 
by V t 3 x the version spaces that would result if the algorithm now queried x and received label j. 
An algorithm A is called a- approximately greedy with respect to P, for a > 1, if at each iteration 
t = 1, . . . , T, the pool example x that A decides to query satisfies 

P{V t ) x )P{V t ^) > -maxF(Vi)P(y M 1 ), 

and the algorithm's output is (h(x\), . . . , h(x m )) for some h G Vt- An algorithm is exactly greedy 
if it is approximately greedy with a = 1. 

We say that A outputs an approximate majority vote if whenever Vt is "pure" enough, the algorithm 
outputs the majority vote on Vt- Formally, A outputs a /^-approximate majority for /3 G (\, 1) if 
whenever there exists a labeling Z : X — > {±1} such that Ph^p\Z <^ h\h G Vt] > /?, A outputs 
Z. In the following theorem we provide a target-dependent label complexity bound, which holds for 
any approximate greedy algorithm that outputs an approximate majority vote. 

1 34 

Theorem 1 Let X = \X\, . . . , x m }. Let H be a hypothesis class, and let P be a distribution over 
H. Suppose that A is a-approximately greedy with respect to P. Further suppose that it outputs 
a ^-approximate majority vote. If A is executed with input (X, L, T) where L <^ h G T~L and 

T > a(21n(l/P(/i)) + ln(i4fl)) • OPT max , thenAoutputs L(l),.. .,L(m). 
138 1 

Denote p m i n = uunhen P{h). When P(h) ^> p m i n , the bound in Theorem [T| i s stronger than 



the gu arantee Vft G H, N(A, h) < 0(log(l/p min ) • OPT max ), obtained by Golo vin and Krause 
| |2010| . Importantly, the following theorem shows that this improved bound cannot be obtained for 
a general approximate-greedy algorithm, even in a simple case such as the problem of thresholds on 
the line. In this setting, the examples are in [0, 1], and the hypothesis class includes all the hypotheses 
defined by a threshold on [0, 1]. Formally 'Hnnc = {h c \ c G [0,1], h c (x) = 1 x > c}r\ 

Theorem 2 Consider pool-based active learning on H\i nc , and assume that P on Wu nc selects h c 
by drawing the value c uniformly from [0, 1]. For any a > 1 there exists an a-approximately greedy 
algorithm A such that for any m > there exists a pool X C [0, 1] of size m, and a threshold c 
such that P(h c ) = 1/2, while the label-complexity of A for L h c is log " m ) • OPT max . 
149 ° S m 

Interestingly, this theorem does not hold for a = 1, that is for the exact greedy algorithm. This 
follows from Theorem[7] which we state and prove in Section[6] 

So far we have considered a general hypothesis class. We now discuss the class of halfspaces W. 
Recall that each halfspace can be described using a vector w from the unit ball of R . For simplicity, 
we will slightly overload notation and sometimes use w to denote the halfspace it determines. We 
let P be the distribution that selects a vector w uniformly from the unit ball in R d . Our algorithm 
for halfspaces, which is called ALuMA, is described in Section [3] ALuMA receives as input an 
extra parameter 5 G (0, 1), which serves as a measure of the desired confidence level. The following 
lemma shows that ALuMA has the desired properties described above with high probability^] 

2 This setting is isomorphic to the case of homogeneous classifiers with examples on a line in R 2 . 

3 The definition of OPT max requires an algorithm that always succeeds, while we allow ALuMA to fail with 
small probability. This restriction on OPT max is made for convenience; The same approximation factor can be 
achieved when the optimal algorithm is allowed the same randomization power as ALuMA — See Appendix IB] 
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Lemma 3 If ALuMA is executed with confidence 5, then with probability 1 — 5 over its internal 
randomization, ALuMA is ^-approximately greedy and outputs a ^/^-approximate majority vote. 
Furthermore, ALuMA is polynomial in the pool size, the dimension, and log(l/<5). 



Combining the above lemma with Theorem[T]we immediately obtain that ALuMA's label complex- 
ity is 0(\og(l/P(h)) ■ OPT max ). We can upper-bound log(l/P(/i)) using the familiar notion of 
margin: For any hypothesis h € W defined by some w € M. d , let ^(h) be the maximal margin of the 
labeling of X by h, namely j(h) = max v .^ v ^ = i min ie [ m ] h(x i )(v 1 x/). It is possible to show (see 
Lemma|9]in Appendix A.2i that P(h) > Q('y(h) d ). As a corollary we obtain: 



Theorem 4 Let X = {x\ , . . . , x m } C Bf, where Bf is the unit Euclidean ball ofM. d . Let 5 e (0,1) 
be a confidence parameter. Suppose that ALuMA is executed with input (X, L, T, 5), where L <^ 
h G W and T > 4(2dln(2/7(A)) + ln(2)) ■ OPT max . Then, with probability of at least 1-5 over 
ALuMA's own randomization, it outputs L(l), . . . , L(m). 

We can consider th e minimal possible margin, 7 = min^ g yv l{h), and deduce from TheoremE] or 
from the results of Golovin and Krause | 2010) , a uniform approximation factor of 0(dlog(l/j)). 
How small can 7 be? The following result bounds this minimal margin from below under the 
reasonable assumption that the examples are represented by numbers of a finite accuracy. 

Lemma 5 Let c > be such that 1/c is an integer and suppose that X C {—1,-1 + c, 
. . . , 1 - c, l} rf . Then, min /lGW j(h) > (c/Vd) d+2 . 

The proof, given in Appendix |A.3| is an adaptation of a classic result due to Muro ga et al.|p96T) . 
We conclude that p m \ n = Vt((c/d) d ), and deduce an approximation factor of d 2 log(d/c) for the 
worst-case label complexity of ALuMA. The exponential dependence of the minimal margin on d 
here is necessary: As shown in Hastad 1 1994| , the minimal margin can indeed be exponentially 



small, even if the points are taken only from {±1} 

We also derive a lower bound, showing that the dependence of our bounds on 7 or on c is necessary. 
Whether the dependence on d is also necessary is an open question for future work. 

Theorem 6 Fix a > 1. For any c > with 1 jc an integer, there exists a pool from 
{—1, —1 + c, . . . , 1 — c, l} 2 , for which min/ ie yy 7(/i) = c/2, and any a- approximately greedy 

algorithm that outputs a majority vote will require Q ( OPT max ^ labels. 

3 The ALuMA algorithm 

We now describe our algorithm, listed below as Alg.[T] and explain why Lemma [3] holds. We name 
the algorithm Active Learning under a Margin Assumption or ALuMA. Its inputs are the unlabeled 
sample X, the labeling oracle L, the maximal allowed number of label queries T, and the desired 
confidence 5 E (0, 1). It returns the labels of all the examples in X. Two building blocks are 
required in order to implement an algorithm with the desired guarantees for half spaces. First, we 
need to be able to select a pool-example that approximately maximizes P(V t 1 a .) • P(V t ~^). Second, 
we need to be able to output the majority vote of a version space that has a high enough purity level. 

For the first building-block, we need to calculate the volumes of the sets V^ x and V t ~ x . Both of 
these sets are convex sets obtained by intersecting the unit ball with halfspa ces. The problem of 



calculating the volume of such convex sets in R d is #P-hard if d is not fixed | |Brightwell and Win 
|kler||1991| . Moreover, deterministically approximating the volume is NP-hard in the general case 



|Matousek 200 2). Luck ily, it is possible to approximate this volume using randomization. Specif- 
ically, in Kannan et al. |1997| a randomized algorithm is provided such that for any convex body 



K C Mr with an efficient separation oracle, with probability at least 1 — 5 the algorithm returns a 
non-negative number T such that (1 — e)T < P(K) < (1 + e)T. The algorithm is polynomial in 
d, 1/c, ln(l/<5). ALuMA uses this algorithm to estimate P (V^ x ) and P^" 1 ) with sufficient accu- 
racy. Using the guarantees of this algorithm and the constants in ALuMA, we can show that ALuMA 
is 4-approximately greedy with probability 1 — 5/2. We denote an execution of this algorithm on a 
convex body K by T <- VolEst(l£', e, 5). 
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t,x 4 i 3: imT> 



Algorithm 1 The ALuMA algorithm 

1: Input: X = {xi,...,x m },L : [to] {-1, 1}, T, S 
h <- [to], Vi <- Mf 
for t = 1 to T do 

Vi E It, j G {±1}, do v x .j 4- 
Select i t 6 argmax ig/ (v Xi ,i 
I t+ i <-I t \ {it} 
Request y = L(i t ) 
Vt+i ^V t n{w: y(w,x it ) > 0} 
end for 

M 4r- [721n(2/<5)~|. 

'' ' ' 7 -l 



Draw wi, . . . , wm ^ -uniformly from Vt 
12: For each Xi return the label j/j = sgn 



T,^LiSga((wj,Xi})J. 



To output an approximate majority vote from the final version space V, we would like to uniformly 
draw several hypotheses from V and label X according to a majority vote over these hypotheses. 
We can ef ficiently draw a hypothesis approximately uniformly from V, by using the hit-and-run 
algorithm |Lovasz 1999 1. For a convex body K, the hit-and-run algorithm runs 0(d 3 /X 2 ) steps and 
returns a random sample according to a distribution which is A-close in total-variation distance to 
the uniform distribution over K. Using the constants in ALuMA, we can show that ALuMA outputs 
a 2/3-approximate majority with probability 1 — 5/2. 



4 Proof Sketch for Theorem [T] 

We give here the main steps in the proof of Theorem [T] Fix a pool X. For any algorithm alg, denote 
by Vt(alg, h) the version space induced by the first n labels it queries if the true labeling of the pool 
is consistent with h. Denote the average version space reduction of alg after n queries by 

/ avg (alg,n) = 1 -E h ^ P [P(V n (alg,h))}. 

Golovin and Krause 1 2010| prove that since A is a-approximately greedy, for any pool-based algo- 
rithm alg, 

f avg {A, n) > / avg (alg, k) - exp(-n/ak). (1) 

Let opt be an algorithm that achieves OPT max . It can be shown that for any hypothesis h 6 T~L and 
any active learner alg, 

/ avg (opt,OPT max ) - / avg (alg,n) > P(h)(P(V n (alg, h)) - P{h)). 

Combining this with Equation (JTJ we conclude that if A is a-approximately greedy then 

P(h) P(hf 



P(V n (A,h)) ~ exp(- 



qOPT„ 



: ) + P(hf 



This means that if P(h) is large enough and we run an approximate greedy policy, then after a 
sufficient number of iterations, most of the remaining version space induces the correct labeling of 
the sample. Specifically, if n > a(2ln(l/P(h)) + \n(j^)) ■ OPT max , then P(h)/P(V n (A, h)) > 
ji. Since A outputs a /3-approximate majority labeling from V n (A, h), A returns the correct labeling. 

5 On the difficulties in greedy active-learning for halfspaces 

At first glance it might seem that there are simpler ways to implement an efficient greedy strategy 
for halfspaces, by using a different distribution P over the hypotheses. For instance, if there are 
to examples in d dimensions, Sauer's lemma states that the effective size of the hypothesis class of 
halfspaces will be at most m d . One can thus use the uniform distribution over this finite class, and 
greedily reduce the number of possible hypotheses in the version space, obtaining a dlog(ro) factor 
relative to the optimal label complexity. However, a direct implementation of this method will be 
exponential in d, and it is not clear whether this approach has a polynomial implementation. 
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Another approach is to discretize the version space, by considering only halfspaces that can be 
represented as vectors on a d-dimensional grid { — 1, — 1 + c, . . . , 1 — c, l} d . This results in a finite 

hypothesis class of size (2/c+ 1) , and we get an approximation factor of 0(dlog(l/c)) for the 
greedy algorithm, compared to an optimal algorithm on the same finite class. However, it is not 
clear whether a greedy algorithm for reducing the number of such vectors in a version space can be 
implemented efficiently, since even determining whether a single grid point exists in a given version 
space is NP-hard [see e.g. Matousek 2002, Section 2.2]. 

Yet another possible direction for pool-based active learning is to greedily select a query whose 
answer would determine the labels of the largest amount of pool examples. The main challenge in 
this direction is how to analyze the label complexity of such an algorithm: it is unclear whether 
competitiveness with OPT can be guaranteed in this case. Investigating this idea, both theoreti- 
cally and experimentally, is an important topic for future work. Note that the CAL algorithm |Cohn 



et al. 1 1994[ , which we discuss in Section [6] can be seen as implementing a mellow version of this 



approach, since it decreases the so-called "disagreement region" in each iteration. 

Inspecting our margin-dependent guarantees, one may wonder if a margin assumption alone can 
guarantee that OPT max is small. This is not the case, as evident by the following example, which is 
an adaptation of an example from Dasgupta [2005]. 



Example 1 Let 7 c (0, |) be a margin parameter. Consider a pool of m points in W 1 , such that 
all the points are on the unit sphere, and for each pair of points X\ and x% {x\, X2) < 1 — 27. It 



was shown in Shannon [ 1959 1 that for any m < 0(l/^ d ), there exists a set of points that satisfy the 
conditions above. For any point x in such a pool, there exists a (biased) halfspace that separates 
x from the rest of the points with a margin of 7. By adding a single dimension, this example can 
be transformed to one with homogeneous (unbiased) halfspaces. Each point in this pool can be 
separated from the rest of the points by a halfspace. Thus, if the correct labeling is all-positive, then 
all m examples need to be queried to label the pool correctly. Therefore OPT max = m. 



6 Other Approaches: Theoretical and Empirical Comparison 

We now compare the effectiveness of the approach implemented by ALuMA to other active learning 
strategies. ALuMA can be characterized by two properties: (1) its "objective" is to reduce the 
volume of the version space and (2) at each iteration, it aggressively selects an example from the pool 
so as to (approximately) minimize its objective as much as possible (in a greedy sense). We discuss 
the implications of these properties by comparing to other strategies. Property (1) is contrasted with 
strategies that focus on increasing the number of examples whose label is known. Property (2) is 
contrasted with strategies which are "mellow", in that their criterion for querying examples is softer. 

Much research has been devoted to the challenge of obtaining a substantial guaranteed improvement 
of label complexity over regular "passive" learning for halfspaces in K d . Examples (for the real- 
izable case) include the Query By Committee (QBC) algorithm |Seung et al.||1992}|Freund et al.| 
|1997) , the CAL algorithm JCohn et ak~]|1994| , and the Active Perceptron |Dasgupta et al.| |2005f ! 



These algorithms are not "pool-based" but rather use "selective-sampling": they sample one exam- 
ple at each iteration, and immediately decide whether to ask for its label. Out of these algorithms, 
CAL is the most mellow, since it queries any example whose label is yet undetermined by the ver- 
sion space. Its "objective" can be described as reducing the number of examples which are labeled 
incorrectly, since it has been shown to do so in many cases IHanneke, 2007 2009, Friedman 2009 1. 
QBC and the active perceptron are less mellow. Their "objective" is similar to that of ALuMA since 
they decide on examples to query based on geometric considerations. 

In the first part of our comparison, we discuss the theoretical advantages and disadvantages of dif- 
ferent strategies, by considering some interesting cases from a theoretical perspective. In the second 
part we report an empirical comparison of several algorithms and discuss our conclusions. 

6.1 Theoretical Comparison 

The label complexity of the algorithms mentioned above is usually analyzed in the PAC setting, thus 
we translate our guarantees into the PAC setting as well for the sake of comparison. We define the 
(e, m, Z?)-label complexity of an active learning algorithm to be the number of label queries that 
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are required in order to guarantee that given a sample of m unlabeled examples drawn from D, the 
error of the learned classifier will be at most e (with probability of at least 1 — 8 over the choice of 
sample). A a pool-based active learner can be used to learn a classifier in the PAC model by first 
sampling a pool of m unlabeled examples from D, then applying the pool-based active learner to 
this pool, and finally running a standard passive learner on the labeled pool to obtain a classifier. For 
the class of halfspaces, if we sample an unlabeled pool of m — (l(d/e) examples, then the learned 
classifier will have an error of at most e (with high probability over the choice of the pool). 

To demonstrate the effect of the first property discussed above, consider again the simple case of 
thresholds on the line defined in Section [2] Compare two greedy pool-based active learners for 
■Hiino : The first follows a binary search procedure, greedily selecting the example that increases 
the number of known labels the most. Such an algorithm requires log(m) queries to identify the 
correct labeling of the pool. The second algorithm queries the example that splits the version space 
as evenly as possible. TheoremfTlimplies a label complexity of 0(log(m) log(l/7(/i))) for such an 
algorithm, since OPT max — log(m). However, a better result holds for this simple case: 

Theorem 7 In the problem of thresholds on the line, for any pool with labeling L, the exact greedy 
algorithm requires at most 0(\og(l/^/(h))) labels. This is also the label complexity of any approxi- 
mate greedy algorithm that outputs a majority vote. 

Comparing the log(m) guarantee of the first algorithm to the log(l/7(ft,)) guarantee of the second, 
we reach the (unsurprising) conclusion, that the first algorithm is preferable when the true label- 
ing has a small margin, while the second is preferable when the true labeling has a large margin. 
This simple example accentuates the implications of selecting the volume of the version space as 
an objective. A similar implication can be derived by considering the PAC setting, replacing the 
binary-search algorithm with CAL, and letting m = 0(l/e). On the single-dimensional line, CAL 
achieves a label-complexity of 0(log(l/e)) = 0(log(m)), similarly to the binary search strategy 
we described. Thus when e is large compared to j(h), CAL is better than being greedy on the vol- 
ume, and the opposite holds when the condition is reversed. QBC will behave similarly to ALuMA 
in this setting. 

To demonstrate the effect of the second property described above — being aggressive versus being 
mellow, we consider the following example, adapted slightly from [Dasgupta, 2006 1. 

Example 2 Consider two circles parallel to the (x, y) plain in M 3 , one at the origin and one slightly 
above it. For a given e, fix 2/e points that are evenly distributed on the top circle, and 2/e points at 
the same angles on the bottom circle (see left illustration below). The distribution D e is an uneven 
mix of a uniform distribution over the points on the top circle and one over the points of the bottom 
circle: The top circle is given a much higher probability. All homogeneous separators label half of 
the bottom circle positively, but an unknown part of the top circle (see right illustration). The bottom 
points can be very helpful in finding tlie correct separator fast, but their probability is low. 



Dasgupta has demonstrated via this example that active learning can gain in label complexity from 
having significantly more unlabeled data. The following theorem shows that the aggressive strategy 
employed by ALuMA indeed achieves an exponential improvement when there are more unlabeled 
samples. In contrast, the mellow strategy of CAL does not significantly improve over passive learn- 
ing in this case. We note that these results hold for any selective-sampling method that guarantees a 
similar error rate to passive ERM given the same sample size. 

Theorem 8 For all small enough e S (0, 1) there is a distribution D e of points in M 3 , such that 

• For m — 0(l/e), the (e, m, D c )-label complexity of any active learner is fi(l/e). 

• For m = fi(log 2 (l/e)/e 2 ), the (e, m, D c )-label complexity of ALuMA is 0(log 2 (l/e)). 

• For any value of m, the (e, m, D e )-label complexity of CAL is Q(l/e). 
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In many applications, unlabeled examples are virtually free to sample, thus it can be worthwhile to 
allow the active learner to sample more examples than the passive sample complexity]^] We show in 
Appendix [C] another example in which the label complexity of CAL can be significantly worse than 
that of the optimal algorithm, even without more unlabeled examples. These examples strengthen 
the observation of |Balcan et al. [ 2007 1 that in some cases a more aggressive approach is preferable. 

On the other hand, CAL has a guaranteed label complexity for cases for which ALuMA currently has 
none. Its label complexity is bounded by 0(d9 log(l/e)), where 9 is the disagreement coefficient, a 
quantity that depends on the distribution and the target hypothesis [Hanneke, 2007 2009 1. Specifi- 
cally, if D is uniform over a sphere centered at the origin, then for all target hypotheses 9 — Q(y/d). 
Thus CAL achieves an exponential improvement over passive learning for this canonical example. 



Figure 1: MNIST (3 vs. 5) 
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6.2 Empirical Comparison 

We carried out an empirical comparison between the al- 
gorithms discussed above. Our goal is twofold: First, to 
evaluate ALuMA in practice, and second, to compare the 
performance of aggressive strategies compared to mellow 
strategies. The aggressive strategies are represented in 
this evaluation by ALuMA and one of the heuristics pro- 
posed by |Tong and K oller |2002|. The mellow strategy 
is represented by CAL. QBC represents a middle-ground 
between aggressive and mellow. We also compare to a 
passive ERM algorithm — one that uses random labeled 
examples. We evaluated the algorithms over synthetic and 
real data sets and compared their label complexity perfor- 
mance. Details on our implementations and additional results are provided in Appendix [D] 

In the first experiment the data was digits 3 and 5 from the MNIST dataset. Figure [T] depicts the 
training error as a function of the label budget. It is striking to observe that CAL provides no 
improvement over passive ERM in the first 1000 examples, while this budget suffices to reach zero 
training error for ALuMA and TK. More results for MNIST and for another real data set can be found 
in Appendix [D] In these experiments, the more aggressive learners consistently perform better. 

The next experiment shows that ALuMA and TK outperform CAL and QBC even on a data sampled 
from the uniform distribution on a sphere in M. d (see Figure^. This result, and a similar one reported 
in the appendix, suggest that ALuMA might have a better guarantee than the general competitive 
result in case of the uniform distribution. This is an open question which is left for future work. 







Figure 2: Uniform distribution (d = 100). 
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To summarize, in all of our tests, aggressive algorithms performed better than mellow ones. These 
results are not fully explained by current theory. The experiments also show that ALuMA and TK 
have comparable success in practice. One might hope that TK enjoys similar theoretical guarantees 
to those of ALuMA. In Appendix[D]we report a synthetic experiment that suggests the contrary. 

4 In the limit of an infinite number of unlabeled examples, if the distribution has a non-zero support on the 
entire domain, the pool-based setting becomes identical to the setting of membership queries | Angluin] |1988| . 
In contrast, we are interested in finite samples. 
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Supplementary Material for NIPS Submission 
"Efficient Pool-Based Active Learning of Halfspaces" 

489 

A Proofs 

491 

A. 1 Proof of Theorem |2] 

493 

For the hypothesis class Hu n e< the possible version spaces after a partial run of an active learner are 
all of the form [a, b] C [0, 1]. First, it is easy to see that binary search on the pool will identify any 
hypothesis in [0, 1] using log(m) example, thus OPT max = log(m). 

Now, Consider an active learning algorithm that satisfies the following properties: 

498 

• If the current version space is [a, b], it selects x E X such that x = min{x | (x— a)(b— x) > 
i max^xn^f,] (x -a)(b- x)}. 

• When the budget of queries is exhausted, if the version space is [a, b], label the points above 
a as positive and the rest as negative. 

503 

It is easy to see that this algorithm is a-approximately greedy, since in this problem V^ x ■ V^ 1 = 
(x — a) (b — x) for all x E [a, b] = V* . Now for a given pool size m > 2, consider a pool of examples 
defined as follows. First, let x\ = 1, X2 = 1/2 and 2:3 = 0. Second, for each i > 3, define Xi + i 



506 
507 
508 



recursively as the solution to (x i+1 - Xi)(l - x i+l ) = £(1/2 - - 1/2) = ^(1/2 - Xi 



Since a > 1, it is easy to see by induction that for all i > 3, Xi+i E (xi, |). Furthermore, 
suppose the true labeling is induced by ft.3/4; Thus the only pool example with a positive label 
is X\, and P(h 3 / 4 ) = 1/2. In this case, the algorithms we just defined will query all the pool 
examples £4, Xg, . . . , x m in order, and only then will it query x 2 and finally x\. If stopped at any 
time t < m — 1, it will label all the points that it has not queried yet as positive, thus if t < m — 1 the 
output will be an erroneous labeling. Finally, note that the same holds for the pool x% , x%, x±, . . . , x m 
that does not include x 3 , so the algorithm must query this entire pool to identify the correct labeling. 



A.2 Proof of lower bound for p u 



511 
512 
513 
514 
515 
516 

Lemma 9 For all h G W, P(h) > . 

519 

Proof Let V = {h' G H : Vi, ft'(xj) = h(x z )}. Choose w S Bf such that Vx € X, h(x){w, x) > 
7. For a given v G Mf, denote by h v £ T~L the mapping x H> sgn((u, x)). Note that for all v £ Bf 
such that \\w — v\\ < 7, h v E V. This is because for all x E X, 

523 
524 
525 

which implies sgn((w, x)) = h(x). It follows that {v \ h v E V} D Bf n B(w, 7), where B(z, r) 
denotes the ball of radius r with center at z. Let u = (1 — j/2)w. Then for any z E B(u, 7/2), we 
have z E Bf , since 



h(x) (v,x) = (v — w, h(x) ■ x) + h{x) (w, x) 
> —\\w — v\\ ■ \\h{x) ■ x\\ + 7 > —7 + 7 = 0, 



\\z\\ = \\z-u + u\\ <\\z-u\\ + \\u\\ < 7/2 + I-7/2 = 1. 



529 
530 

In addition, z E B(w,j) since 

532 

\\z — w\\ — \\z — u + u — w\\ < \\z — u\\ + ||u — w\\ < 7/2 + 7/2 = 7. 
Therefore B(u, 7/2) C Bf n B(w, 7). We conclude that {v \ h v E V} 3 B(u, 7/2). Thus, 

535 

P(h) = Pr[V] > Vol(B( w , 7 /2))/Vol(Bf) > (|)' 

538 
539 
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A.3 Proof of Lemma H] 
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555 



For simplicity assume that the pool of examples span all of W l . Then, it is easy to show that if w 
the solution to the above problem then there exist d linearly independent examples from the pool, 
denoted w.l.o.g. by x\, . . . , Xd, such that L(i)(w, xi) — 1 for all i. In other words, w is the solution 
of the linear system Aw = b where the rows of A are x%, . . . , x^ and b = (L(l), . . . , L(m)) T . 



540 
541 

Let us multiply all examples in the pool by 1 /c. Then, all the elements of all examples in the pool 
are integers. Choose a labeling L which is consistent with some w* . Consider the optimization 
problem: 

min||i/;|| 2 s.t. Vi, L(i)(w, xA > 1 . 

w 

546 
547 
548 
549 
550 

By Cramer's rule, w% = det(Ai) /det(A), where is obtained by replacing column i of A by the 
vector b. Since all elements of A are integers and A is invertible, we must have that |det(A)| > 1. 
Therefore, |u>,-| < |det(vLj)|. Furthermore, by Hadamard's inequality, |det(ylj)| is upper bounded by 
the product of the norms of the columns of Ai. Since each element of A4 is upper bounded by 1/c, 

we obtain that the norm of each column is at most — , hence |det(j4j)| < (\fd/c) d . It follows that 

||u>| < Vd(Vd/c) d . Hence, the margin is 

1 1 1 

558 -— — > 

HI maj 

560 

A.4 Proof of Theorem© 



INI " Vd(Vd/c) d ■ Vd/c Vd(Vd/c) 



d+l 



For any m, consider the pool X = {(-1, 1)} U {(1 - 2~ fc , 1) : k = 0, 1, . . . , m - 2} C 
illustrated below. 
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By following a binary search, the optimal algorithm can identify all the labels using log(m) queries. 
In contrast, it is easy to verify that the exact greedy algorithm will query all the points if all the 
examples but the rightmost are negative. We therefore obtain the approximation factor of mj log(m) 
on the label complexity compared to the optimal algorithm. Since the points lie on the grid with 
c = 2~( m ~ 2 ) we obtain that the approximation factor is order of log(l/c)/loglog(l/c). It is 
also easy to verify that the margin here is order of c. For an approximately-greedy strategy, the 
example can be adapted by replacing 1 — 2~ k with 1 — a~ k for a « a, to make sure that the 
approximately-greedy strategy will query all the points in the same case. Thus in this case we get 
m w log(l/c)/ log(a). 



A.5 Proof of Theorem^] 
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First, assume that the algorithm is exactly greedy. A version space for 'Hnnc is described by a 
segment in [a, b] C [0, 1], and a query at point a results in a new version space, [a, a] or [a, b), 
depending on the label. We now show that for every version space [a, b], at most two greedy queries 
suffice to either reduce the size of the version space by a factor of at least 2/3, or to determine the 
labels of all the points in the pool. 



Assume for simplicity that the version space is [0, 1], and denote the pool of examples in the version 



587 

space by X. Assume w.l.o.g. that the greedy algorithm now queries a < 5. If a > 1/3, then any 
answer to the query will reduce the version space size to less than 2/3. Thus assume that a < 1/3. 
If the query answer results in the version space [0, a) then we are done since this version space is 
smaller than 2/3. We are left with the case that the version space after querying a is [a, 1]. Since 
the algorithm is greedy, it follows that for f3 = min{a; € X | x > a}, we have f3 > 1 — a: this is 
because if there was a point /3 E (a, 1 — a), it would cut the version space more evenly than a, in 
contradiction to the greedy choice of a. Note further that (a, 1 — a) is larger than [1 — a, 1] since 
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Figure 3: Illustration for the proof of Theorem [8] 



a < 1/3. Therefore, the most balanced choice for the greedy algorithm is f3. If the query answer for 
j3 cuts the version space to ((3, 1] then we are done, since 1 — (3 < a < 1/3. Otherwise, the query 
answer leaves us with the version space (a, ($). This version space includes no more pool points, by 
the definition of (3. Thus in this case the algorithm has determined the labels of all points. 

It follows that if the algorithm runs at least t iterations, then the size of the version space after t 
iterations is at most (2/3)'/ 2 . If the true labeling has a margin of 7, we conclude that (2/3)*/ 2 > 7, 
thust < 0(log(l/7)). 

A similar argument can be carried for ALuMA, using a smaller bound on a and more iterations due 
to the approximation, and noting that if the correct answer is in (a, 1 — a) then a majority vote over 
thresholds drawn randomly from the version space will label the examples correctly. 

A.6 Proof of TheoremE] 

Proof Assume that l/(2e) is an odd integer and e < 1/8. Let D a be the uniform distribution over 
points on the top circle, defined by 

S a = {a n = f (—= cos27ren, —= sin27ren, — — ) : n € {0,1, ... , 1/e — 1}} . 
v2 v2 v 2 

Let Db be the uniform distribution over points on the bottom circles, defined by 

Sb = {b n d = (cos27ren, sin27ren, 0): n G {0, 1, . . . , 1/e — 1}} . 

Let D e /2 be the distribution (1 — r)D a + rDb, where r = 41o J 4 < e . Note that in order to label D e / 2 

correctly with error no more than e/2, all the labels of points in S a need to be determined. We prove 
each of the theorem statements in order. We consider the label complexity with high probability 
over the choice of unlabeled sample, where high probability is 1 — (5 for some fixed 6 £ (0, 1/2). 

Part I If the unlabeled sample contains only points from S a , then an active learner has to query 
all the points in S a to distinguish between a hypothesis that labels all of S a positively and one that 
labels positively all but one point in S a . Since the probability of the entire set Sb is o(e), an i.i.d. 
sample of size 0(1/ e), will not contain a point from Sb, thus any active learner will require f2(l/e) 
labels. 

Part II Assume now that the size of the sample is at least 41 °g( 4 / e )J°s( 1 /( e ' 5 )) it is easy to check 
that with probability at least 1 — S, the sample contains all the points in S a U Sb- Given such a sample 
as a pool, we now show that OPT max = 0(log(l/e), by describing an active learning algorithm 
that achieves this label complexity: 

1. For all possible separators, the points bo — (1, 0, 0) and 6i/2 £ = (—1, 0, 0) have different 
labels. The algorithm will first query these initial points, and then apply a binary search to 
find the boundary between negative and positive labels in Sb- This identifies the labels of 
all the points in Sb using 0(log(l/e)) queries. 
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2. Of the points in Sb, half are labeled positively and half negatively. Moreover, there are 
n%, n,2 and y £ { — 1, 1} such that b nij .. . , 6„ 2 are all labeled by y, and n-z — rii + 1 = 
|5 b |/2 = i (see illustration in Figure pj. Let n 3 = " 2 + ni (this is the middle point with 
label y). is an integer because n-i — n\ is even, thus their sum is also even. Let 71,4 = 
mod (713 + l/2e, l/2e). Query the points a„ 3 and a„ 4 for their label. 

3. If a„ 3 and a„ 4 each have a different label, apply a binary search starting from these points to 
find the boundaries between positive and negative labels in S a , using (9(log(l/e)) queries. 
Otherwise, label all the examples in S a by the label of a„ 3 . 



they both have the same label, then all the examples in S a also share that label. Let h* be the true 
hypothesis, defined by some homogeneous separator, and assume w.l.o.g that {b n \ h*(b n ) = 1} = 
{b n £ Sb I b n [l] > 0} (note that no point has b n [l] = since l/2e is odd). It follows that 723 = 
and n A = l/2e, thus a„ 3 = (l/\/2, 0, l/VI) and a„ 4 = (-1/^/2,0,1/^2) (see illustration in 
Figure 13). We prove the following lemma below: 



This algorithm uses 0(log(l/e)) queries to label the sample. If a„ 3 and a„ 4 have different labels, 
it is clear that the algorithm labels all the examples correctly. We only have left to prove that if 

659 
660 
661 
662 
663 

Lemma 10 Assume l/2e is odd. If {b n £ Sb | h*(b n ) = 1} = {b n \ b n [l] > 0} and h*(a ) = 
h* (ffli/2e) = V then Va„ £ S a , h* (a„) = y. 

666 

If follows that OPT 

max — 

0(log(l/e)). 

To bound the label complexity of ALuMA, it suffices to bound from below the minimal margin of 
possible separators over thegiven sample. Let h* be the correct hypothesis. By the same argument 
as in the proof of Lemma [5] there exists some w £ M? that labels the sample identically to h* 
and attains its maximal margin on three linearly independent points a, b, c from our sample. Hence, 
Aw = 1 where A £ ]R 3x3 is the matrix whose rows are a,b,c £ S a U Sb- By Cramer's rule, for 
every % £ [3] 



detA 



J detA ' 

where Ai is the matrix obtained from A by replacing the i th column with the vector 1. Recall that the 
absolute value of the determinant of A is the volume of the parallelepiped whose sides are a, b and 
c. Since a, b, c are linearly independent, each of S a and Sb includes at most two of them. Assume 
that a, b £ S a and c £ Sb- In this case, the surface area of the basis of this parallelepiped, defined 

by a and b, is at least sln ^- 7r£ , and the height is 1/ y/2. Hence, 

680 & / v 

. , sin27re ^. , 

682 1 1 - 2 W 

The case where two of the points are in Sb leads to an even larger lower bound. Since the elements 
in each Ai are in [—1, 1], we also have that |detAj| < 3! = 6. Thus, for i £ [3] we obtain that Wi — 
0(l/e). All in all, we get ||w||2 = 0(l/e), and thus 7(/i*) = Sl(e) . Applying Theorem|4] we obtain 
that ALuMA classifies all the points correctly using 0(log(l/~/(h*)) ■ OPT max ) = 0(log 2 (l/e)) 
labels. 

688 

Part III CAL examines the examples sequentially at a random order, and queries the label of any 
point whose label is not determined by previous examples. Thus, if the true hypothesis is all-positive 
on S a , and CAL sees all the points in S a before seeing any point in Sb, it will request Sl(l/e) labels. 
Hence, it suffices to show that there is a large probability that CAL will indeed examine all of S a 
before examining any point from Sb- Let A be the event that the first - log - examples of an i.i.d. 
sample contain any element from Sb- Then, by the union bound, ¥(A) < 1 log(|) • 4 lo e g 4 =1/4. 
Assume now that A does not occur. Let B be the event that the first = log = examples do not contain 

all the elements in S a . Then, by the union bound, P{B) < 1(1 - e) « log f < 1/4. All in all, with 
probability at least 1/2, CAL see all the points in S a before seeing any point in Sb and thus its label 
complexity is Q(l/e). ■ 

699 
700 

Proof [of LemmallOl We prove the lemma for the case h* (cii/2e) = L The case h* (do) = — 1 can 
be proved similarly. Let w* be any hyperplane which is consistent with h*. Let n\ = — | and let 
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ri2 = ni + 1. Then 



702 
703 

b ni = (cos(7r/2 — ire), sin(7r/2 — ire), 0), and 
6„ 2 = (cos(vr/2 + ire), sin(vr/2 + ire), 0). 

706 

By the assumption of the lemma, (w* , b ni ) > and (w* , &„ 2 ) < 0. It follows that w*[l] sin7re > 
w*[2]cos7re and — w*[l] sin7re < uj* [2] cos 7re. As a consequence, we obtain that | w* [2] | < 
w*[l]tan(7re). 

Now, choose some n £ {0, . . . , 1/e — 1}. We show that the corresponding element in S a is labeled 
positively. First, from the last inequality, we obtain 

712 i 

(w*,a n ) = —=(w*, (cos2iren, sin27ren, 1)} 
v 2 
1 

> —=(wl(cos2iren — t&n(Tre) sm(2Tren)) + W3). (2) 

716 ^2 

We will now show that 

718 
719 
720 
721 
722 
723 
724 
725 

cos(7r — 2en) — tan(e7r) sin(7r — 2eir) = — cos(27re) — 2sin 2 (7re) 

727 _ „^2(„^ c ,;Jl< 

728 
729 

Therefore, we obtain from Equation (j2j) and Equation Q that 

731 

— (w*,a n ) > _(_«,• [1] +1U *[3]) = K, (-1/^2,0, 1/^2)) = (w*,a 1/2e ) > 0, 
V2 V2 

where the last inequality follows from the assumption that h* (a 1 / 2e ) = 1. 

735 
736 
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Vn€ {0, 1, ...,1/e- 1}, cos 27ren - tan(7re) sin(27ren) > -1. (3) 

From symmetry, it suffices to prove this for every n £ {0, 1, ... , l/(2e) — 1}. We divide our 
range and conclude for each part separately; since e < 1/8, we have that tane7r < 1. Then, 
cosa — tan(7re)sina > —1 in the range a £ [0,tt/2]. For a £ [7r/2,7r — ire], it can be shown 
that the function cos a — tan(7re) sin a is monotonically decreasing, thus it suffices to show that the 
inequality holds for n = l/(2e) — 1. Indeed, 



cos 2 (7re) — sin 2 (7re) 
-1 . 



B Randomization in the Optimal Algorithm 



Recall that ALuMA is allowed to use randomization, and it can fail to output the correct label with 
probability 6, In contrast, in the definition of OPT max we required that the optimal algorithm 
always succeeds, in effect making it deterministic. One may suggest that the approximation factor 
we achieve for ALuMA in Theorem|4]is due to this seeming advantage for ALuMA. We now show 
that this is not the case — the same approximation factor can be achieved when ALuMA and the 
optimal algorithm are allowed the same probability of failure. Let m be the size of the pool and let 
d be the dimension of the examples, and set So — ^s- Denote by -^(.4, h) the number of calls to 
L that A makes before outputting (L{x\), . . . , L(x m )) with probability at least 1 — 6, for L <^ h. 
Define OPT 5o = min^ max,, N 5o (A, h). 

First, note that by setting 5 = S Q in ALuMA, we get that (ALuMA, h) < O (\og(l / P (h)) ■ 
OPT max ). Moreover, ALuMA with 5 = S is polynomial in m and d (since it is polynomial in 
ln(l/5)). Second, by Sauer's lemma there are at most m d different possible labelings for the given 
pool. Thus by the union bound, there exists a fixed choice of the random bits used by an algorithm 
that achieves OPTa , that leads to the correct identification of the labeling for all possible labelings 
L(l), . . . , L(m). It follows that OPT,s = OPT max . Therefore the same factor of approximation 
can be achieved for ALuMA with S = Sq, compared to OPT,5 . 
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C Example: Gap between OPT and CAL 



756 
757 

Example 3 Consider a distribution in Mr that is supported by two types of points on an octahedron 
(see an illustration for R 3 below). 

760 

1. Vertices: {ei, . . . , e^}- 

762 
763 

Consider the hypothesis class W = {x i— >• sgn((ir, w) — 1+ 4) I iu G {— 1, Each hypothesis 

765 (] 

in W, defined by some w G { — 1, +1} , classifies at most d+1 data points as positive: these are 

the vertices eifor i such that w[i] = +1, and the face center w/d. 



766 
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770 
771 
772 
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2. Face centers: z/dfor z G { — 1, +l} d . 



/ • i • 

< ' „* 

^Vx'' • / 

774 \\ / 

775 

776 

Theorem 11 Consider Example^for d > 3, assume that the pool of examples includes the 

entire support of the distribution. There is an efficient algorithm that finds the correct hypothesis 

from W with at most d labels. On the other hand, with probability at least - over the randomization 
779 „ . 

of the sample, CAL uses at least 2c i+3 l a ^els to find the correct separator. 

781 

Proof First, it is easy to see that if h* G W is the correct hypothesis, then 

783 w = (h*(e 1 ),...,h*(e d )). 

784 
785 

We now show that the number of queries CAL asks until finding the correct separator is exponential 
in d. CAL inspects the unlabeled examples sequentially, and queries any example whose label 
cannot be inferred from previous labels. Consider some run of CAL (determined by the random 
ordering of the sample). Assume w.l.o.g. that each data point appears once in the sample. Let S be 
the set that includes the positive face center and all the vertices. Note that CAL cannot terminate 
before either querying all the 2 d — 1 negative face centers, or querying at least one example from S. 
Moreover, CAL will query all the face centers it encounters before encountering the first example 



Thus, it suffices to query the d vertices to discover the true w. 



from S. At each iteration t before encountering an example from S, there is a probability of „JY/ 
793 d 

that the next example is from S. Therefore, the probability that the first T = §3+5 examples are not 

from S is 

n (1 d+1 ) > (1 - d+1 ) T > e - T i4fe - e ^ 1 
ill 1 21+d-tj-y 1 v+d-T) - e - e _ e' 

798 t=0 

where in the second equality we used 1 — a > cxp(— 2a) which holds for all a G [0, |]. Therefore, 

-I ryd I J 

with probability at least ^ the number of queries is at least fj^f ■ ■ 

802 
803 

D Additional Experiments 

805 

In this appendix we provide the details on our implementation of the algorithms in our experiments. 
We also provide results of additional experiments comparing ALuMA, TK, CAL, QBC and passive 
ERM. Our implementation of ALuMA uses hit-and-run samples instead of full-blown volume esti- 



mation. QBC is also implemented using hit-and-run as in |Gilad-Bachrach et al. |2005|. For both 



ALuMA and QBC, we used a fixed number of mixing iterations for hit-and-run, which we set to 
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Figure 4: Left: MNIST 4 vs. 7. Right: PCMAC 



1000. We also fixed the number of sampled hypotheses at each iteration of ALuMA to 1000, and 
used the same set of hypotheses to calculate the majority vote for classification. CAL and QBC 
examine the examples sequentially, thus the input provided to them was a random ordering of the 
example pool. The algorithm TK is the first heuristic proposed in |Tong and Koller| | |2002[ , in which 
the example chosen at each iteration is the one closest to the max-margin solution of the labeled 
examples known so far. Since the active learners operate by reducing the training error, the graphs 
below compare the training errors of the different algorithms. The test errors show a similar trend. 

In each of the algorithms, the classification of the training examples is done using the version space 
defined by the queried labels. The theory for CAL and ERM allows selecting an arbitrary predictor 
out of the version space. In QBC, the hypothesis should be drawn uniformly at random from the 
version space. We have found that all the algorithms show a significant improvement in classification 
error if they classify using the majority vote classification proposed for ALuMA. Therefore, in all of 
our experiments below, the results for all the algorithms are based on a majority vote classification. 

Our first data set is MNIStQ The examples in this data set are gray-scale images of handwritten 
digits in dimension 784. Each digit has about 6, 000 training examples. We performed binary active 
learning by pre-selecting pairs of digits. One experiment was already reported in Section |6.2| The 



One experiment was already reported in Section 6.2 
second experiments is reported in Figure [4] (left): In this case we trained the active learners on data 
of the the digits 4 and 7, which are linearly separable just like 3 and 5. The results are very similar 
to those obtained for the digits 3 and 5. 

We also tested the algorithms on the PCMAC datasej^] This is a real-world data set, which rep- 
resents a two-class categorization of the 20-Newsgroup collection. The examples are web-posts 
represented using bag-of-words. The original dimension of examples is 7511. We used the Johnson- 
Lindenstrauss projection to reduce the dimension to 300, which kept the data still separable. We 
used a training set of 1000 examples. Figure |4] (right) depicts the results. We were not able to run 
QBC long enough to use its entire label budget, as it tends to become slower when the training error 
becomes small. 

For the uniform distribution, Figure [5] depicts the training error as a function of the label budget 
when learning a random halfspace over the uniform distribution in M 10 . The difference between the 



performance of the different algorithms is less marked for d = 10 than for d = 100 (see Section 6.2 1 
suggesting that the difference grows with the dimension. 



d 


ALuMA 


TK 


QBC 


CAL 


ERM 


10 


29 


156 


50 


308 


1008 


12 


38 


735 


113 


862 


3958 


15 


55 


959 


150 


2401 


> 20000 



Table 1 : Octahedron: number of queries to achieve zero error 



http://yann.lecun.com/exdb/mnist/ 



http://vikas.sindhwani.org/datasets/lskm/matlab/pcmac.mat 
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Figure 5: Uniform distribution (d = 10). 



Finally, we report a synthetic experiment. In this experiment the pool of examples is taken to be 
the support of the distribution described in Example [3] with an additional dimension to account for 
halfspaces with a bias. We also added the negative vertices — to the pool. Similarly to the proof of 
Theorem 1 1 it suffices to query the vertices to reach zero error. Table[T]lists the number of iterations 
required in practice to achieve zero error by each of the algorithms. In this experiment, unlike the 
rest, ALuMA is not only much better than QBC and CAL, it is also much better than TK, which 
is worse even than QBC here. This suggests that TK might not have guarantees similar to those 
of ALuMA, despite the fact that they both attempt to minimize the same objective. The number of 
queries ALuMA requires is indeed close to the number of vertices. 
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